Discrete Time Series

Lecture 07

Dr. Colin Rundel

Random variable review

Mean and variance of RVs

  • Expected Value

\[ E(X) = \begin{cases} \sum_x x \; P(X = x) & \text{$X$ is discrete}\\ \int_{-\infty}^{\infty} x \; f(x) \; dx & \text{$X$ is continuous} \end{cases} \]

  • Variance

\[ \begin{aligned} Var(X) &= E\Big(\big(X-E(X)\big)^2\Big) = E(X^2)-E(X)^2 \\ &= \begin{cases} \sum_x \big(x - E(X)\big)^2 \; P(X = x) & \text{$X$ is discrete}\\ \int_{-\infty}^{\infty} \big(x-E(X)\big)^2 \; f(x) \; dx & \text{$X$ is continuous} \end{cases} \end{aligned} \]

Covariance of RVs

\[ \begin{aligned} Cov(X,Y) &= E\Big(\big(X-E(X)\big)\big(Y-E(Y)\big)\Big) = E(XY)-E(X)E(Y) \\ &= \begin{cases} \sum_x \big(x - E(X)\big)\big(y - E(Y)\big) \; P(X = x, Y=y) & \text{$X$ is discrete}\\ \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} \big(x-E(X)\big)\big(y-E(Y)\big) \; f(x,y) \; dx \; dy & \text{$X$ is continuous} \end{cases} \\ \\ Corr(X,Y) &= \frac{Cov(X,Y)}{\sqrt{Var(X)\,Var(Y)}} \end{aligned} \]

Properties of Expected Value

  • Constant

    \(E(c) = c\) if \(c\) is constant

  • Constant Multiplication

    \(E(cX) = cE(X)\)

  • Constant Addition

    \(E(X+c) = E(X)+c\)

  • Addition

    \(E(X+Y) = E(X)+E(Y)\)

  • Subtraction

    \(E(X-Y) = E(X)-E(Y)\)

  • Multiplication

    \(E(XY) = E(X)\,E(Y)\)

    if \(X\) and \(Y\) are independent

Properties of Variance

  • Constant

    \(Var(c) = 0\) if \(c\) is constant

  • Constant Multiplication

    \(Var(cX) = c^2~Var(x)\)

  • Constant Addition

    \(Var(X+c) = Var(X)\)

  • Addition

    \(Var(X+Y) = Var(X)+Var(Y)\)

    if \(X\) and \(Y\) are independent.

  • Subtraction

    \(Var(X-Y) = Var(X)+Var(Y)\)

    if \(X\) and \(Y\) are independent.

Properties of Covariance

  • Constant

    \(Cov(X,c) = 0\) if \(c\) is constant

  • Identity

    \(Cov(X,X) = Var(X)\)

  • Symmetric

    \(Cov(X,Y) = Cov(Y,X)\)

  • Constant Multiplication

    \(Cov(aX, bY) = ab ~ Cov(X,Y)\)

  • Constant Addition

    \(Cov(X+a, Y+b) = Cov(X,Y)\)

  • Distribution

    \(Cov(aX+bY,cV+dW) = ac~Cov(X,V) + ad~Cov(X,W)+bc~Cov(Y,V)+bd~Cov(Y,W)\)

Discrete Time Series

Stationary Processes

A stocastic process (i.e. a time series) is considered to be strictly stationary if the properties of the process are not changed by a shift in origin.

In the time series context this means that the joint distribution of \(\{y_{t_1}, \ldots, y_{t_n}\}\) must be identical to the distribution of \(\{y_{t_1+k}, \ldots, y_{t_n+k}\}\) for any value of \(n\) and \(k\).

Weakly Stationary

Strict stationary is unnecessarily strong / restrictive for many applications, so instead we often opt for weak stationary which requires the following,

  1. The process must have finite variance / second moment \[E(y_t^2) < \infty \text{ for all $t$}\]

  2. The mean of the process must be constant \[E(y_t) = \mu \text{ for all $t$}\]

  3. The cross moment (covariance) may only depends on the lag (i.e. \(t-s\) for \(y_t\) and \(y_s\)) \[Cov(y_t,y_s) = Cov(y_{t+k},y_{s+k}) \text{ for all $t,s,k$}\]

When we say stationary in class we will almost always mean weakly stationary.

Autocorrelation

For a stationary time series, where \(E(y_t)=\mu\) and \(\text{Var}(y_t)=\sigma^2\) for all \(t\), we define the autocorrelation at lag \(k\) as

\[ \begin{aligned} \rho_k &= Cor(y_t, \, y_{t+k}) = \frac{Cov(y_t, y_{t+k})}{\sqrt{Var(y_t)Var(y_{t+k})}} \\ &= \frac{E\left( (y_t-\mu)(y_{t+k}-\mu) \right)}{\sigma^2} \end{aligned} \]

this can be written in terms of the autocovariance function (\(\gamma_k\)) as \[ \begin{aligned} \gamma_k &= \gamma(t,t+k) = Cov(y_t, y_{t+k}) \\ \rho_k &= \frac{\gamma(t,t+k)}{\sqrt{\gamma(t,t) \gamma(t+k,t+k)}} = \frac{\gamma(k)}{\gamma(0)} \end{aligned} \]

Covariance Structure

Based on our definition of a (weakly) stationary process, it implies a covariance of the following structure,

\[ \boldsymbol{\Sigma} = \left( \begin{matrix} \gamma(0) & \gamma(1) & \gamma(2) & \gamma(3) & \cdots & \gamma(n-1) &\gamma(n) \\ \gamma(1) & \gamma(0) & \gamma(1) & \gamma(2) & \cdots & \gamma(n-2) &\gamma(n-1) \\ \gamma(2) & \gamma(1) & \gamma(0) & \gamma(1) & \cdots & \gamma(n-3) &\gamma(n-2) \\ \gamma(3) & \gamma(2) & \gamma(1) & \gamma(0) & \cdots & \gamma(n-4) &\gamma(n-3) \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ \gamma(n-1) & \gamma(n-2) & \gamma(n-3) & \gamma(n-4) & \cdots & \gamma(0) & \gamma(1) \\ \gamma(n) & \gamma(n-1) & \gamma(n-2) & \gamma(n-3) & \cdots & \gamma(1) & \gamma(0) \\ \end{matrix} \right) \]

Example - Random walk

Let \(y_t = y_{t-1} + w_t\) with \(y_0=0\) and \(w_t \sim N(0,1)\).

ACF + PACF

Stationary?

Is \(y_t\) stationary?

Partial Autocorrelation - pACF

Given these type of patterns in the autocorrelation we often want to examine the relationship between \(y_t\) and \(y_{t+k}\) with the (linear) dependence of \(y_t\) on \(y_{t+1}\) through \(y_{t+k-1}\) removed.

This is done through the calculation of a partial autocorrelation (\(\alpha(k)\)), which is defined as follows:

\[ \begin{aligned} \alpha(0) &= 1 \\ \alpha(1) &= \rho(1) = Cor(y_t,y_{t+1})\\ &~~\vdots \\ \alpha(k) &= Cor(y_t - P_{t,k}(y_t),~ y_{t+k} - P_{t,k}(y_{t+k})) \end{aligned} \]

where \(P_{t,k}(y)\) is the projection of \(y\) onto the space spanned by \(y_{t+1},\ldots,y_{t+k-1}\).

pACF - Calculation

Let \(\rho(k)\) be the autocorrelation for the process at lag \(k\) then the partial autocorrelation at lag \(k\) will be \(\phi(k,k)\) given by the Durbin-Levinson algorithm,

\[ \phi(k,k) = \frac{ \rho(k) - \sum_{t=1}^{k-1} \phi(k-1, t) \, \rho(k-t) }{ 1 - \sum_{t=1}^{k-1} \phi(k-1, t) \, \rho(t) } \] where \[ \phi(k,t) = \phi(k-1,t) - \phi(k,k) \, \phi(k-1, k-t) \\ \]

Starting with \(\phi(1,1) = \rho(1)\) we can solve iteratively for \(\phi(2,2), \ldots, \phi(k,k)\).

Example - Random walk with drift

Let \(y_t = \delta + y_{t-1} + w_t\) with \(y_0=0\) and \(w_t \sim N(0,1)\).

ACF + PACF

Stationary?

Is \(y_t\) stationary?

Example - Moving Average

Let \(w_t \sim N(0,1)\) and \(y_t = w_{t-1}+w_t\).

ACF + PACF

Stationary?

Is \(y_t\) stationary?

Autoregressive

Let \(w_t \sim N(0,1)\) and \(y_t = y_{t-1} - 0.9 y_{t-2} + w_t\) with \(y_t = 0\) for \(t < 1\).

ACF + PACF

Tidy time series

ts objects

In base R, time series are usually encoded using the ts S3 class,

co2
        Jan    Feb    Mar    Apr    May    Jun    Jul    Aug    Sep
1959 315.42 316.31 316.50 317.56 318.13 318.00 316.39 314.65 313.68
1960 316.27 316.81 317.42 318.87 319.87 319.43 318.01 315.74 314.00
1961 316.73 317.54 318.38 319.31 320.42 319.61 318.42 316.63 314.83
1962 317.78 318.40 319.53 320.42 320.85 320.45 319.45 317.25 316.11
1963 318.58 318.92 319.70 321.22 322.08 321.31 319.58 317.61 316.05
1964 319.41 320.07 320.74 321.40 322.06 321.73 320.27 318.54 316.54
1965 319.27 320.28 320.73 321.97 322.00 321.71 321.05 318.71 317.66
1966 320.46 321.43 322.23 323.54 323.91 323.59 322.24 320.20 318.48
1967 322.17 322.34 322.88 324.25 324.83 323.93 322.38 320.76 319.10
1968 322.40 322.99 323.73 324.86 325.40 325.20 323.98 321.95 320.18
1969 323.83 324.26 325.47 326.50 327.21 326.54 325.72 323.50 322.22
1970 324.89 325.82 326.77 327.97 327.91 327.50 326.18 324.53 322.93
1971 326.01 326.51 327.01 327.62 328.76 328.40 327.20 325.27 323.20
1972 326.60 327.47 327.58 329.56 329.90 328.92 327.88 326.16 324.68
1973 328.37 329.40 330.14 331.33 332.31 331.90 330.70 329.15 327.35
1974 329.18 330.55 331.32 332.48 332.92 332.08 331.01 329.23 327.27
1975 330.23 331.25 331.87 333.14 333.80 333.43 331.73 329.90 328.40
1976 331.58 332.39 333.33 334.41 334.71 334.17 332.89 330.77 329.14
1977 332.75 333.24 334.53 335.90 336.57 336.10 334.76 332.59 331.42
1978 334.80 335.22 336.47 337.59 337.84 337.72 336.37 334.51 332.60
1979 336.05 336.59 337.79 338.71 339.30 339.12 337.56 335.92 333.75
1980 337.84 338.19 339.91 340.60 341.29 341.00 339.39 337.43 335.72
1981 339.06 340.30 341.21 342.33 342.74 342.08 340.32 338.26 336.52
1982 340.57 341.44 342.53 343.39 343.96 343.18 341.88 339.65 337.81
1983 341.20 342.35 342.93 344.77 345.58 345.14 343.81 342.21 339.69
1984 343.52 344.33 345.11 346.88 347.25 346.62 345.22 343.11 340.90
1985 344.79 345.82 347.25 348.17 348.74 348.07 346.38 344.51 342.92
1986 346.11 346.78 347.68 349.37 350.03 349.37 347.76 345.73 344.68
1987 347.84 348.29 349.23 350.80 351.66 351.07 349.33 347.92 346.27
1988 350.25 351.54 352.05 353.41 354.04 353.62 352.22 350.27 348.55
1989 352.60 352.92 353.53 355.26 355.52 354.97 353.75 351.52 349.64
1990 353.50 354.55 355.23 356.04 357.00 356.07 354.67 352.76 350.82
1991 354.59 355.63 357.03 358.48 359.22 358.12 356.06 353.92 352.05
1992 355.88 356.63 357.72 359.07 359.58 359.17 356.94 354.92 352.94
1993 356.63 357.10 358.32 359.41 360.23 359.55 357.53 355.48 353.67
1994 358.34 358.89 359.95 361.25 361.67 360.94 359.55 357.49 355.84
1995 359.98 361.03 361.66 363.48 363.82 363.30 361.94 359.50 358.11
1996 362.09 363.29 364.06 364.76 365.45 365.01 363.70 361.54 359.51
1997 363.23 364.06 364.61 366.40 366.84 365.68 364.52 362.57 360.24
        Oct    Nov    Dec
1959 313.18 314.66 315.43
1960 313.68 314.84 316.03
1961 315.16 315.94 316.85
1962 315.27 316.53 317.53
1963 315.83 316.91 318.20
1964 316.71 317.53 318.55
1965 317.14 318.70 319.25
1966 317.94 319.63 320.87
1967 319.24 320.56 321.80
1968 320.09 321.16 322.74
1969 321.62 322.69 323.95
1970 322.90 323.85 324.96
1971 323.40 324.63 325.85
1972 325.04 326.34 327.39
1973 327.02 327.99 328.48
1974 327.21 328.29 329.41
1975 328.17 329.32 330.59
1976 328.78 330.14 331.52
1977 330.98 332.24 333.68
1978 332.38 333.75 334.78
1979 333.70 335.12 336.56
1980 335.84 336.93 338.04
1981 336.68 338.19 339.44
1982 337.69 339.09 340.32
1983 339.82 340.98 342.82
1984 341.18 342.80 344.04
1985 342.62 344.06 345.38
1986 343.99 345.48 346.72
1987 346.18 347.64 348.78
1988 348.72 349.91 351.18
1989 349.83 351.14 352.37
1990 351.04 352.69 354.07
1991 352.11 353.64 354.89
1992 353.23 354.09 355.33
1993 353.95 355.30 356.78
1994 356.00 357.59 359.05
1995 357.80 359.61 360.74
1996 359.65 360.80 362.38
1997 360.83 362.49 364.34
typeof(co2)
[1] "double"
class(co2)
[1] "ts"
attributes(co2)
$tsp
[1] 1959.000 1997.917   12.000

$class
[1] "ts"

tidyverts

This is an effort headed by Rob Hyndman (of forecast and fpp3 fame) and others to provide a consistent tidydata based framework for working with time series data and models.

Core packages:

  • tsibble - temporal data frames and related tools

  • fable - tidy forecasting / modeling

  • feasts - feature extraction and statistics

  • tsibbledata - sample tsibble data sets

tsibble

A tsibble is a tibble with additional infrastructure for encoding temporal data - specifically a tsibble is a tidy data frame with an index and key where

  • the index is the variable that describes the inherent ordering of the data (from past to present)

  • and the key is one or more variables that define the unit of observation over time

  • each observation should be uniquely identified by the index and key

global_economy

tsibbledata::global_economy
# A tsibble: 15,150 x 9 [1Y]
# Key:       Country [263]
   Country  Code   Year    GDP Growth   CPI Imports Exports Population
   <fct>    <fct> <dbl>  <dbl>  <dbl> <dbl>   <dbl>   <dbl>      <dbl>
 1 Afghani… AFG    1960 5.38e8     NA    NA    7.02    4.13    8996351
 2 Afghani… AFG    1961 5.49e8     NA    NA    8.10    4.45    9166764
 3 Afghani… AFG    1962 5.47e8     NA    NA    9.35    4.88    9345868
 4 Afghani… AFG    1963 7.51e8     NA    NA   16.9     9.17    9533954
 5 Afghani… AFG    1964 8.00e8     NA    NA   18.1     8.89    9731361
 6 Afghani… AFG    1965 1.01e9     NA    NA   21.4    11.3     9938414
 7 Afghani… AFG    1966 1.40e9     NA    NA   18.6     8.57   10152331
 8 Afghani… AFG    1967 1.67e9     NA    NA   14.2     6.77   10372630
 9 Afghani… AFG    1968 1.37e9     NA    NA   15.2     8.90   10604346
10 Afghani… AFG    1969 1.41e9     NA    NA   15.0    10.1    10854428
# ℹ 15,140 more rows

vic_elec

tsibbledata::vic_elec
# A tsibble: 52,608 x 5 [30m] <Australia/Melbourne>
   Time                Demand Temperature Date       Holiday
   <dttm>               <dbl>       <dbl> <date>     <lgl>  
 1 2012-01-01 00:00:00  4383.        21.4 2012-01-01 TRUE   
 2 2012-01-01 00:30:00  4263.        21.0 2012-01-01 TRUE   
 3 2012-01-01 01:00:00  4049.        20.7 2012-01-01 TRUE   
 4 2012-01-01 01:30:00  3878.        20.6 2012-01-01 TRUE   
 5 2012-01-01 02:00:00  4036.        20.4 2012-01-01 TRUE   
 6 2012-01-01 02:30:00  3866.        20.2 2012-01-01 TRUE   
 7 2012-01-01 03:00:00  3694.        20.1 2012-01-01 TRUE   
 8 2012-01-01 03:30:00  3562.        19.6 2012-01-01 TRUE   
 9 2012-01-01 04:00:00  3433.        19.1 2012-01-01 TRUE   
10 2012-01-01 04:30:00  3359.        19.0 2012-01-01 TRUE   
# ℹ 52,598 more rows

aus_retail

tsibbledata::aus_retail
# A tsibble: 64,532 x 5 [1M]
# Key:       State, Industry [152]
   State                        Industry `Series ID`    Month Turnover
   <chr>                        <chr>    <chr>          <mth>    <dbl>
 1 Australian Capital Territory Cafes, … A3349849A   1982 Apr      4.4
 2 Australian Capital Territory Cafes, … A3349849A   1982 May      3.4
 3 Australian Capital Territory Cafes, … A3349849A   1982 Jun      3.6
 4 Australian Capital Territory Cafes, … A3349849A   1982 Jul      4  
 5 Australian Capital Territory Cafes, … A3349849A   1982 Aug      3.6
 6 Australian Capital Territory Cafes, … A3349849A   1982 Sep      4.2
 7 Australian Capital Territory Cafes, … A3349849A   1982 Oct      4.8
 8 Australian Capital Territory Cafes, … A3349849A   1982 Nov      5.4
 9 Australian Capital Territory Cafes, … A3349849A   1982 Dec      6.9
10 Australian Capital Territory Cafes, … A3349849A   1983 Jan      3.8
# ℹ 64,522 more rows

as_tsibble()

Existing ts objects or data frames can be converted to a tsibbles easily,

tsibble::as_tsibble(co2)
# A tsibble: 468 x 2 [1M]
      index value
      <mth> <dbl>
 1 1959 Jan  315.
 2 1959 Feb  316.
 3 1959 Mar  316.
 4 1959 Apr  318.
 5 1959 May  318.
 6 1959 Jun  318 
 7 1959 Jul  316.
 8 1959 Aug  315.
 9 1959 Sep  314.
10 1959 Oct  313.
# ℹ 458 more rows
tibble(
  co2 = c(co2),
  t = c(time(co2)) |> tsibble::yearmonth()
) |>
  tsibble::as_tsibble(index = t)
# A tsibble: 468 x 2 [1M]
     co2        t
   <dbl>    <mth>
 1  315. 1970 Feb
 2  316. 1970 Mar
 3  316. 1970 Apr
 4  318. 1970 May
 5  318. 1970 Jun
 6  318  1970 Jul
 7  316. 1970 Aug
 8  315. 1970 Sep
 9  314. 1970 Oct
10  313. 1970 Nov
# ℹ 458 more rows

plotting tsibbles

As the tsibble is basically just a tibble which is just a data frame both base and ggplot plotting methods will work with tsibbles.

tsibble::as_tsibble(co2) |>
  plot()

tsibble::as_tsibble(co2) |>
  ggplot(
    aes(x=lubridate::as_date(index), y=value)
  ) +
    geom_point() +
    geom_line()

autoplot

tsibble::as_tsibble(co2) |>
  autoplot(.vars = vars(value))

Multiple variables

tsibbledata::vic_elec |>
  tsibble::index_by(data = ~ lubridate::as_date(.)) |>
  summarize(
    demand_avg = mean(Demand, na.rm=TRUE),
    temp_avg = mean(Temperature, na.rm=TRUE)
  ) |>
  autoplot(.vars = vars(temp_avg, demand_avg))

gg_tsdisplay()

tsibble::as_tsibble(co2) |>
  feasts::gg_tsdisplay(y = value)

gg_tsdisplay() w/ pACF

tsibble::as_tsibble(co2) |>
  feasts::gg_tsdisplay(y = value, plot_type = "partial")

Example - Australian Wine Sales

Australian total wine sales by wine makers in bottles <= 1 litre. Jan 1980 – Aug 1994.

data(AustralianWine, package="Rssa")
( aus_wine = AustralianWine |> 
    tsibble::as_tsibble() |> 
    filter(key == "Total", !is.na(value))
)
# A tsibble: 176 x 3 [1M]
# Key:       key [1]
      index key   value
      <mth> <chr> <int>
 1 1980 Jan Total 15136
 2 1980 Feb Total 16733
 3 1980 Mar Total 20016
 4 1980 Apr Total 17708
 5 1980 May Total 18019
 6 1980 Jun Total 19227
 7 1980 Jul Total 22893
 8 1980 Aug Total 23739
 9 1980 Sep Total 21133
10 1980 Oct Total 22591
# ℹ 166 more rows

Time series

Autocorrelation plots

Lag plot

feasts::gg_lag(aus_wine, lags = 1:16, geom = "point")

Seasonal plot

feasts::gg_season(aus_wine)

Subseries plot

feasts::gg_subseries(aus_wine)

A model?

l = lm(value ~ lag(value,12), data = aus_wine)
summary(l)

Call:
lm(formula = value ~ lag(value, 12), data = aus_wine)

Residuals:
     Min       1Q   Median       3Q      Max 
-12529.5  -1226.2    116.1   1744.0   6802.8 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    3.692e+03  9.903e+02   3.728 0.000267 ***
lag(value, 12) 8.684e-01  3.824e-02  22.707  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2594 on 162 degrees of freedom
  (12 observations deleted due to missingness)
Multiple R-squared:  0.7609,    Adjusted R-squared:  0.7594 
F-statistic: 515.6 on 1 and 162 DF,  p-value: < 2.2e-16

Predictions

Observed vs predicted